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1. .ABSTRAQO? / 

; f? r-* * ' ■ *■ ~' r> 

WiyaSRSWR architecture for object modeling and recognition for an auto- 
nomous land vehicle/^ Examples of objects of interest include terrain features, 
fields, roads, horizon features, trees, etc. The architecture is organized around a 
set of data bases for generic object models and perceptual structures, temporary 
memory for the instantiation of object and relational hypotheses, and a long term 
memory for storing stable hypotheses that are affixed to the terrain representa- 
tion. Multiple inference processes operate over these databases. We^Iescribe 
these particular components: the perceptual structure database, the grouping 

processes that operate over this, schemas, and the long term terrain database. 

rnBirillfr".“ith a processing example that matches predictions from the long 
term terrain model to imagery', extracts significant perceptual structures for con- 
sideration as potential landmarks, and extracts a relational structure to update 
the long term terrain databases ^ 
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2. INTRODUCTION 


Terrain ana object models for autonomous land vehicles (A LvV are 
required for a wide range of applications including route and tactical planning, 
location verification through the recognition of terrain features and oojecrs. anc 
acquiring new information about the environment as it is explored. The follow- 
ing lists important criteria for terrain and object modeling capabilities. 


Descriptive Adequacy: The modeling technique should be capable o: 
describing *.ne objects and situations In the environment necessary for the vehicle 
to function. This Induces representing natural as well as man-made objects. It 
should be a consistent representation that supports modular system development 
and uniform inference procedures that can operate over diferent types of objects 
at diferent levels of detail. Uniform shape, object subpart ana surface attribute 
affixments are necessary *o :o this. 


Recognition Adequacy: Much of the activity of an .VLV Is :or:cernec wit. 
determining where it is and what is around :t. Terrain models snouid he mam- 
p i.ac.e .or ueter::t;;..ng trie itnscr-cujea ippear-Uc rs ;; vcr.u ^ tweets n;t c 
controlling recognition processing. This invokes the formation of ;enera. predic- 
tions of sensor derived features from the terrain mocel. iuen predictions wil. 
often be uncertain and qualitative due to incomplete pr:o* knoweage of tne ter- 
rain. 


Handing Uncertainty: The existence ana exact environmental ocation x 
objects will often not oe enow - with complete certainty. Locations will often a? 
determined relative to otner known locations me not with respect to a giooahy 
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consistent terrain map. This is true, for instance, when the sensor displacement 
parameters are not well determined. It is necessary to represent this uncertainty 
explicitly in the terrain model so incrementally acquired information can be used 
for disambiguation. 

Learning: A vehicle will learn about the environment as it moves through 
it. Associating new information wuth the terrain representation should be 
straightforward. This is difficult to do. for example, by changing values in a raw 
elevation array. Types of information to be affixed to the terrain representation 
include newly discovered objects, details of expected objects, and the processing 
used in object recognition. 

F usion of Information: The ALV must build a consistent environmental 
model over time from different sensors. As an object is approached, its image 
appearance and scale will change considerably, yet it has to be recognized as the 
same object, with newly acquired information associated with the unique instance 
of the general object type. In a typical situation, a distant dark terrain patch 
will be partially recognized based upon distinctive visual characteristics, but may 
be either a building or a road segment. As it is approached, its image appearance 
changes considerably, making disambiguation possible. This requires the 
representation of multiple hypotheses, each formated with respect to the proper- 
ties of the potential world objects. The structure of the object description should 
direct the accumulation of information. 

A further consideration in developing and evaluating terrain modeling capa- 
bilities is that there is not a single ALV. Instead, there are a wide range of auto- 
nomous vehicles, indexed by a diverse range of active and passive sensors md 
assumptions about a priori data. There is a continuum from systems having a 
complete initial model of the terrain and perfect sensors to those with no a priori 
mode!, and highly imperfect sensors. For example, a robot with no a priori data 
and only an unstabilized optical sensor will probably model the environment in 
terms of a sequence of views related by landmarks and distinct visual events 
embedded in a representation that is more topological than metric. .An ALV 
solely dependent on optical imagery will have to deal with the huge variability in 
the appearance of objects. Experience has shown that even road surfaces have 
nighty variable visual characteristics. Alternatively, a few pieces of highly pre- 
selected visual Information can serve to verify predictions from a reliable and 
detailed terrain model and precise position and range sensors. 

We call a general object model a schema. A schema can represent per- 
ceived. but unrecognized, visual events, as %veil as recognized objects and their 
-eiationships in environmental scenes. The architectural design is focused about 
the representation, instantiation, and inference over schemas developed by the 
ALV is it moves through the environment. Schemas are related to similar con- 
cepts found in Hanson et.aL - 7$ and Ohta - SO . The short term terrain 
-ep-esentatlon tons is" s of schema instantiations that mo resent accumulated cer- 
ceptuai evidence for objects as attributes and relations that are hypothesized 
with varying levels of certainty. 

Ooject models are used to organize perceptual processing bv integrating 
descriptive representations with recognition and segmentation control. One 
aspect of this is the use of different types of attributes and inheritance relations 
between generic schemas for -epresentaiion in !S-A and PART-OF hierarchies. A 
particular object attribute relates three dimensional world properties of an object 
inc sensor dependent view information, either by a set of generic views or 


viewing procedures. These viewing attributes are also inherited and modified 
according to different object types. In many systems, objects are treated as lists 
of attributes that are matched against extracted image features. Here they are 
treated as specifying an active control process that directs image segmentation by 
specifying grouping procedures to extract and organize image structures. 

Another critical aspect of the architecture is the various types of spatial, 
localization relations that deal with uncertainty and learning by associating 
different types of perceptually derived information with terrain models. For 
example. local (multi-sensor) viewframes affix sets of schemas and un-recognized 
perceptual structures into local “ robot *s-eve* % views of an ALV’s environment. 
Path-affixments between local viewframes support fusion of information in time 
without necessarily corresponding to locations in an a priori grid. 

This effort has developed an architecture for terrain and object recognition 
compatible with the wide range of potential sensor configurations and the 
different qualities of a priori data. 

There has been work in artificial Intelligence, computer vision, and graphics 
that satisfy the individual requirements for object modeling capabilities, but little 
has been done to integrate them. To date, there is no vision system that can 
interpret general natural scenes, although some can deal with restricted environ- 
ments Hanson et.al. - 7$ while other systems are restricted to artificial objects 
and environments. Brooks* Brooks - 54 representation based on generalized 
cylinders meets, or could be extended to deal with, many of these functions. It 
has well defined shape attribute inheritance between a set of progressively more 
complex object models, and affLxment relations that could be generalized to han- 
dle uncertainty, [t can also be used to generate constraints on image features 
from object models. Nonetheless, the system built around this representation has 
had limited success beyond dealing with essentially orthographic views of 
geometrically well defined man-made objects. This appears to be partially 
because the constraints on image structures generated from the abstract instances 
of object models are too general to generate initial correspondences between 
models and image structures. Brook’s system also used an impoverished set of 
image descriptions, and the object models could not direct the segmentation pro- 
cess directly during their instantiation. The majority of work in terrain modeling 
deals with how well a representation can realistically model three dimensional 
terrain, but not how It : s used for recognition. The simplicity of a mocei that is 
described by a few parameters is not useful for recognition unless it can direct 
constrained searches against Image data. For example. Pentland*s Pentland - S3 
use of fractals satisfies aspects of descriptive adequacy for natural terrain, but has 
been ies- effective '“or recognition. Kuipers Kuipers - 52 has produced an 
interest*!; terrain model for learning and handling uncertainty, but it Is non 
visual. Rt ated to this Is Kuan s Kuan - 54 object based terrain representation 
for planning that : s organized in *erms of distinct, modifiable objects, but is also 
not assoc, a tea wit a sensor aerivea processing resuits. 


3. ARCHITECTURE OVERVIEW 


The system architecture ‘onsists of 
processes. The inference processes transform 
data structures, and modifying * he existing 


several databases and inference 
the databases, creating additional 
ones. The task interface focuses 
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attention in system processing and monitors progress toward system task goals. 
This high level architecture is depicted in Figure 1. The boxes with square 
corners in this figure represent databases, the ellipses represent inference 
processes, and arrows indicate dataflow. 


3.1 SYSTEM DATABASES 

At the highest level there are three databases. These are the short term 
memory (STM), long term memory (LTM), and generic models. 

The STM acts as a dynamic scratchpad for the vision system. It has two 
sub-areas, a perceptual structures database (PSDB) and a hypothesis space. The 
PSDB includes incoming imagery from sensors, immediate results of extracting 
image structures such as curves, regions and surfaces, spatial temporal groupings 
of these structures, and results of inferring 3D information. 

The hypotheses space contains statements about objects and terrain in the 
world. A hypothesis is represented as an instantiated schema. The schema 
points to the various perceptual structures in the PSDB that provide evidence 
that the object represented by the schema (such as a terrain patch, road, tree, 
etc.) exists in the world. Other types of hypotheses include grids, viewframes, 
and viewpaths. Grids are a special type of terrain representation that contain 
elevation information and are derived from range data or successive depth maps 
from motion stereo. Viewpaths, as partially ordered sequences of viewframes. 
give space time relationships between hypotheses. Viewframes are sets of 
hypotheses that correspond to what can be seen from a localized position. A 
hypothesis with no associated perceptual structures is a prediction. As structures 
and localization are incrementally added to a hypothesis, it progresses on the con- 
tinuum from predicted to recognized. Hypotheses that have enough evidence 
associated with them to be considered recognized and stable, are moved to the 
LTM. 

4 
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Figure 1: Terrain Modeling and Recognition System Architecture 
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The LTM stores a priori terrain representations, the long term terrain data- 
base, and hypotheses with enough associated evidence to be considered visually 
stable. A priori data concerning elevation and terrain type information, as well 
as knowledge of specific landmarks are stored in the LTM. A viewframe. 
representing a certain location in the world is stored in the LTM if the evidence 
associated with it could be re-used to recognize the local environment if it was 
re-encountered. Consistency of one hypothesis with another is not required for 
storage in the LTM. 

The model space stores generic object models, the inheritance relations of 
the (model) schema network, and a set of image structure grouping processes and 
rules for evaluating image structure interestingness. Generic models are used 
dynamically to instantiate and guide search processes to associate evidence to an 
object instance. Inheritance relations are used by various schema inference pro- 
cedures to propagate structures, attributes and relations between object instan- 
tiations. For instance, the generic two-lane-road schema has an “IS- A” relation- 
ship to the generic road schema. It follows, based on the inheritance models, that 
an instantiation of the two-lane-road schema will inherit the more general charac- 
teristics of the generic road schema that in turn inherits the more general charac- 
teristics of a terrain patch. Unlike the STM and LTM, the model space is not 
modified by inference processes. 


3.2 INFERENCE PROCESSES 

At the highest level, there are five different sorts of inference processes in 
the vision system. These are perceptual inference, location inference, object 
instantiation, LTM STM instantiation, and the task interface. 

The PSDB is initialized with the output of standard multi-resolution image 
processing operations for smoothing, edge extraction, flow field determination, 
etc. Much subtler inference is required for grouping processes that produce con- 
nected curves, textures, surfaces, and temporal matches between image struc- 
tures. These grouping operations are typically model guided. There are generic 
models which may be task dependent) of what constitutes “interestingness** of 
an image structure. 

The hypothesis inference processes produce tasks for the perceptual 
processes. These may be satisfied by simple queries over the PSDB such as “find 
ail long lines in this region of image*', where “long", “line" and “region** are suit- 
ably interpreted. Queries can be more complex, requiring, for instance, temporal 
stability, such as “find ^il homogeneous green texture regions that are matched 
ii.e.. remain in the field of view) over at least two seconds of imagery'**, where, 
again, qualitative descriptors are rigorously defined. Alternatively, the requested 
perceptual structures may be dynamically extracted. In this case, a history of the 
processing attempts ar.c results are maintained. Lf similar requests are mute 
later, such as if we were to view the same environment from a different perspec- 
tive. these processing histories could be used to recall a processing sequence that 
produced successful results. 

Location processes include a number of different modes of spatial location 
representation and inference. While exact location information is used when it is 
available, a key concept is the qualitative representation of relative location. 
This is fundamental, because the problem of acquiring terrain knowledge from 
moving sensors involves handling perceptual information that arises from 


317 


multiple coordinate systems that are transforming in time. The basic approach 
to location inference is to represent the location of world objects in a qualitative 
manner that does not require the full knowledge of continuous transformations of 
sensor coordinates relative to the vehicle the sensors are mounted on. or of 
transformations of vehicle coordinates relative to the terrain. 

The main structures involved in location inference are viewframes. 
viewpaths, and grids. Viewframes represent both metric location information 
about world objects derived from range sensors and view-based location informa- 
tion about the directions in which objects are found derived from passive sensor 
data. 


Generic schemas are models of world objects that include information and 
procedures on how to predict and match the object models in the available sensor 
data. Besides representing 3D geometric constraints, 2D-3D sensor view appear- 
ance including effects of change in resolution and environmental effects such as 
season, weather, etc., schemas also indicate contextual relationships with other 
objects, type and spatial constraints, similarity and conflict relations, spatial 
localization, and appearance in viewframes. 

Object schema instantiation may occur by model-driven prediction from a 
priori knowledge, or directly from another instantiation and a PART-OF relation. 
The other instantiation process may also occur by matching a distinctive percep- 
tual structure to a schema appearance instance. This sort of “triggering’’ is more 
common in situations where there is little a priori information to guide predic- 
tion. Object instantiations generate queries to the PSDB grouping process in 
order to complete matching. 

A key idea in object instantiation processing is inference over the model 
schema network hierarchies. Direct representation and inference over a large 
enough body of world objects to accomplish outdoor terrain understanding 
requires very large memory and proportionately lengthy inference procedures over 
that memory' space. Hierarchical representation makes a significant reduction in 
storage requirements: furthermore, it lends itself naturally to matching schema to 
world objects at multiple levels of abstraction, thus speeding the inference pro- 
cess. Two basic hierarchies are the IS-A and PART-OF trees. 

IS-A hierarcnies represent the refinement of object classification. Figure 2 
shows part of an IS-A hierarchy for terrain representation. The level of 
terrestrial-object tells us that we will not see evidence of any schema instance 
below this node as perceptual structures surrounded by sky. At the level of 
terrain-patch we pick up the geometric knowledge of adherence to the ground 
plane, while information stored at the level of a road schema constrains the boun- 
daries of a terrain patch to be locailv linear (with other constraints). Types 
beneath road add critical appearance constraints in color and texture, while the 
final refinement level In the IS-A hierarchy, the number of lanes, further ton- 
strains size parameters inherited from the road schema. 

PART-OF hierarchies represent the decomposition of world objects into 
components, each of which is. itself, another world object. Figure 3 shows a 
PART-OF hierarchy decomposition for a generic 2-lane-roaa. PART-OF hierar- 
chies contain relative geometric information that is useful in prediction and 
search. 
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Figure 2: IS- A Hierarchy 


Figure 3: Part of Hierarchy 


As object instantiation inference reasons up and down schema network 
hierarchies, incrementally matching perceptual structures and other data to 
instances of object appearance in the world, a history mechanism records the 
inference processing steps, parameters and results. This dynamic data structure 
is called the schema instantiation structure. One important aspect of this struc- 
ture is that it can used to extract the inference and processing sequence(s) that 
worked earlier to see the same object, or ones that are similar. This accounts for 
the fact that distinctiveness in image appearance is an idiosyncratic process that 
depends upon many factors which are difficult to model and control, such as 
current motion, wind, varying outdoor illumination, etc. 


4. PERCEPTUAL PROCESSING AND THE PSDB 


Perceptual processing is concerned with organizing images into meaningful 
chunks. The definition of •‘meaningful” and the development of explicit criteria 
to evaluate segmentation techniques involves, from a data-driven perspective, 
that the chunks have characterizing properties, such as regularity, connectedness, 
and not tending to fragment the image. From a model-driven point of view, seg- 
mentation appropriateness corresponds to the extent to which the pieces can be 
matched to structures and predictions derived from object models. T rom either 
perspective, a basic requirement is that image segmentation procedures find 
significant image structures, independent of world semantics, in order to initialize 
and cue model matching. This allows for the extraction of world events such as 
surfaces, boundaries, and : nteresting oatterns ^dependent of understanding per- 
ceptions in the context of a particular object. These, in turn, are useful abstrac- 
tions from image information to match against object models or describe the 
characteristics of novel objects. 

The Perceptual Structure Data Base (PSDB), conceptualized in Figure 4, 
contains several different types of information. These are classified as images, 
perceptual objects, and groups. Images are the arrays of numbers obtained from 
the different sensors and the results of low level image processing (such is con- 
tour extraction and region growing routines) that produce such arrays. It s 
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difficult for the symbolic relational representations used for object models, such 
as schemas, and the processing rules in computer vision systems, to work directly 
with an array of numbers. Therefore, there are many spatially-tagged, symbolic 
representations used in image understanding systems that describe extracted 
image structures such as the primal sketch _Marr - 82 , the RSV structure of the 
VISIONS system Hanson et.al. - 78 , and the patchery data structure of Ohta 
Ohta - 80,. We found it useful to build such a representation around a set of 
basic perceptual objects corresponding to points curves, regions, surfaces, and 
volumes. 

Groupings are recursively defined to be a related set of such objects. The 
relation may be exactly determined, as in representing which edges are directly 
adjacent to a region, or they may require a grouping procedure to determine the 
set of objects that satisfy the relationship. Groupings can occur over space, e.g., 
linking texture elements under some shape criteria such as compactness and den- 
sity, or over time, as in associating instances of perceptual structures in succes- 
sive images. We stretch the concept a bit, so that groupings also refer to general 
non-image registered perceptual information, such as histograms. 


4.1 INITIALIZATION OF THE PSDB 

Whenever new sensor data is obtained, a default set of operations are per- 
formed to initialize the PSDB. Edges are extracted at multiple spatial frequen- 
cies and decomposed into linear subsegments. The edges are extracted into dis- 
tinct connected curves, and general attributes such as average intensity, contrast, 
and variance are associated with them. Similar processing is performed for 
regions extractions. Histograms are computed with respect to a wide range of 
object based and image based characteristics in a pyramid like structure. These 
default operations are used to initialize bottom-up grouping processes and schema 
instantiations. These, in turn, determine significant structures using heuristic 
interestingness rules to prioritize the structures for the application of grouping 
processes or object instantiations. 


4.2 IMAGES 

Images are the data arrays derived from the optical and laser range sensors 
and the results of image processing routines for operations including histogram- 
based segmentation, different edge operators, optic Sow field computations, and 
so forth. .Associated with images are several attributes for time of acquisition, 
relevant sensor parameters, etc. Processing history is maintained in the process- 
ing relationship structure that keeps track of the processing history of ail objects 
in the PSDB. 


4.3 PERCEPTUAL OBJECTS 

Points, curves, regions, surfaces, and volumes are basic types of perceptual 
structures that are accessible to object instantiations and grouping processes. .Am 
example instance of a curve structure is shown in Figure 5. This figure shows 
many common representational characteristics of perceptual objects. There are 
default attributes associated with particular objects, such as endpoints, length 
and positions for a curve. There is aiso an associated attribute-list mechanism 
for incorporating more general properties with an object. This list is accessible 




Figure 4: Perceptual Structure Data Base (PSDB) Figure 5: Curve Example 
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Figure 4: Parallel Grouping 


by keywords acd a general query mechanism using methods specific to the partic- 
ular associated attribute. The associated attributes in the example are shown in 
capital letters. There are many types of attributes that can be consistently asso- 
ciated with a curve using this mechanism. 

A useful representation for performing geometric operations and queries 
over objects is the OBJECT LABEL-GRID (or GRID: in the example curve. 
The number 6 indicates the index of this structure). This is an image where each 
pixel contains a vector of pointers back to the set of perceptual objects and 
groups which occupy that position. This allows geometric operations to be per- 
formed directly on the grid. Filtering operations can be applied to the OBJECT 
LABEL GRID to restrict processing based upon attributes associated with 
objects. Various types of masks can be associated with objects to reflect a direc- 
tional or uniform neighborhood to determine object relationships in the OBJECT 
LABEL GRID. 
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4.4 GROUPS 


A group is a set of related perceptual objects. The relation can be deter- 
mined directly by a query over an object and those surrounding it, as in finding 
the set of curves within some distance of a given region. Alternatively, it may 
require a search process to find the set of objects meeting some, potentially com- 
plex, criteria. For example, an ordered set of curves can be grouped together 
using thresholds on allowable changes in the average contrast and orientation of 
successive elements. By expressing the grouping process as a search over a state 
space of potential groups, each group becomes a potential hypothesis in the 
PSDB. Groups can also reflect temporal relationships; this occurs in matching 
structures in successive images. A relational grouping procedure is shown in Fig- 
ure 6 for the determination of nearby parallel lines with opposite contrast direc- 
tions. This is done for a linear segment by first extracting nearby neighbors 
using a narrow mask oriented perpendicular from the segment at its mid-point. 
The intersection of this mask with points in the label grid are determined, and 
then each candidate is evaluated by checking if it is within allowable thresholds 
for length, contrast, and orientation. It is then ordered with respect to the smal- 
lest magnitude of the difference vector computed from the average gradients. 
The grouping processes can either produce the best candidate as a potential 
grouping, or some set of them. 

Two different types of grouping processes have been developed: measure- 
based and interestingness-based. The measure based grouper is a generalization 
of established edge and region linkers Martelli - 76;. It uses a measure consisting 
of: 


.) some value to be optimized, such as length, minimal curvature, com- 
pactness, or a composite scalar value 

2) local constraints on allowable changes in attributes 

3) global thresholds on attributes 


The measure and associated constraints are optimized by a best first search 
returning several ordered candidate groups. The measure to be used can be asso- 
ciated with a prediction from an object model for substance or shape characteris- 
tics. The measure to be optimized can also be determined directly from initially 
extracted objects by selecting those that are extreme in some attribute or are 
correlated with the attributes of surrounding objects to derive a measure to be 
optimized. 

The measure based grouper is currently being generalized into one based on 
interestingness, it involves the basic processing .oop shown in Figure 7. initially, 
basic perceptual objects including curves, regions, junctions and their associated 
attributes are extracted using conventional techniques. Extracted objects are 
represented in label grids to express spatial neighborhood operations over the 
objects. A uniform neighborhood is established for each object, and directed rela- 
tions are formed with the adjacent objects in each neighborhood. These relations 
are represented in a small number of types of match relationships that contain 
descriptions of the correlation of attributes, subcomponent matching, and compo- 
site properties. 
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Selected attributes of the extracted perceptual objects and the match struc- 
tures are then sorted into lists with pointers back to the associated objects. 
These lists are for attributes such as size, average feature values, variance of 
feature values, compactness, the extent of correlation between the components 
and attributes of different structures, and the number of groups an object is 
involved in. These different rankings are then combined using a selection criteria 
to choose the set of interesting perceptual objects and relationships. The selec- 
tion criteria sets the required position in different subsets of the sorted attribute 
lists. An example is to find 100 largest objects in the top 10 of any of the attri- 
bute correlation lists. The selection criteria is modifiable during processing and is 
meant to reflect the influence of model-based predictions. 

Interestingness is used to focus the application of grouping rules to a 
selected set of objects and relations between objects indicated in match struc- 
tures. The grouping rules then combine perceptual objects to form new percep- 
tual objects, or groups, based upon the type of relation between the objects. 
Neighborhoods are established with respect to these derived groups to form new 
relationships. These in turn are sorted in the attribute lists with respect to the 
previously extracted perceptual objects. In addition to the relations established 
in uniform neighborhoods, for some groups, non-uniform relations are also esta- 
blished. Processing can continue indefinitely as less and less interesting relations 
become candidates for the application of grouping rules. Explicit criteria are 
needed to stop processing; e.g., we can limit processing time, determine when 
there is a uniform covering of the image with extracted groups, or when struc- 
tures belong to unique groups. 
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Figure 7: Grouping Processing Flow Figure $: Grouping Architecture 
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These operations are performed by virtual processors called grouping nodes. 
Grouping nodes are seen as covering regular and adjacent portions of an image 
area (not necessarily of a single image, because there can be multiple images in a 
motion sequence). The image area contains some portion of a label plane for 
accessing the objects based upon their spatial dispositions as well as object-based 
associated attributes. The grouping nodes are further organized in a hierarchical 
pyramid shown in Figure 8. Each node is connected to its adjacent neighbors 
and has a parent and descendants. The transfer of information between nodes at 
different levels is based upon interestingness. Lower level processes send their 
most interesting structures up the hierarchy. There are several effects of this. 
One is that it allows a uniform processing to occur at different levels, so grouping 
rules can be applied to objects at different levels of interestingness. It also allows 
relations between nonspatially adjacent structures to be handled in a uniform 
architecture. It also partitions perceptual structures in a way that corresponds to 
different levels of control in instantiation of object models. 

Organizing segmentation in terms of grouping processes has many advan- 
tages for a model based vision system. The grouping processes can be run 
automatically from extracted significant structures based upon perceptually 
significant, though non-semantic criteria. Thus, connected curves of slowly 
changing orientation or compact, homogeneous regions can be extracted purely 
on perceptual criteria. These image structures correspond to world structure and 
events, and they are useful for initializing schema instantiations. They 
correspond to the qualitative image predictions associated with more general 
schemas. An inference process for compilation from an object model into group- 
ing processes, allows model based vision to have a very active character quite 
different from single-level attribute matching. 


5. SCHEMAS 


Schemas represent hypotheses about objects in the world. The process of 
schema instantiation creates an instance of a schema together with evidence for 
that schema. Evidence consists of structures in the PSDB, a priori knowledge 
stored in the LTM, predictions derived from location inference, and relations to 
already instantiated schema. 

Table 1 shows the various slots and relationships in a generic schema. 
Although this data structure has a frame-like appearance, it is useful to view the 
schema as a semantic net structure, with slots representing nodes in the net and 
relationships representing arcs. Schema instantiation inference reasons from a 
(partially) instantiated node, follows arcs, and infers procedures to execute from 
the sum of its acquired information in order to obtain more evidence to iurther 
instantiate the schema. 

The schema network is a generic set of data structures that indicate the a 
priori relationships between schemas. A key part of this network is the inheri- 
tance hierarchies that indicate which descriptions and relationships can be inher- 
ited from schema to schema. Inheritance hierarchies allow efficient matching of 
objects in the worid against sensor evidence from progressively coarser to finer 
levels. As reasoning moves from coarser to finer levels of description in model- 
based schema instantiations, the schemas inherit descriptive bounds and add new 
descriptions, and also add constraints to inherited ones. For example, the system 
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Table 1: Generic Schema Data Structure 
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may first recognize an object as a terrain patch (because it lies on the ground 
plane). A road is a type of terrain patch (see Figure 1, that adds linear boundary 
description, and constrains the visual image appearance of the terrain patch 
schema in the color and texture descriptors. The two basic types of schema net- 
work inheritance hierarchies are IS-A and PART-OF. 

Below is a brief explanation of each of the slots and relationships in the 
generic schema data structure. Schema type refers to the generic name of the 
schema in the IS-A hierarchy. Schema name is the identification of the schema 
instance, e.g., if the schema type is ‘‘road” then the schema name might be 
‘highway 101’. The schema instantiation structure maintains the control history 
of the schema recognition inference processes for this schema. 

The 3D description is an object-centered view of the world object 
•epresented by the schema. It includes its 3D geometry and shape description, 
actual size, and inherent color and texture (as opposed to how its color and tex- 
ture might appear to a particular 3ensor). Note that this is the description that 
matches the 3chema-object before looking at its structure refined into com- 
ponents. For example, the 3D geometric description of a tree schema does not 
separate the canopy from the trunk, but gives a single enclosing volume as its 
representation. The volumetric descriptions of the trunk and canopy appear as 
the 3D descriptors on their schema further down the PART-OF hierarchy. Thus, 
inferring down the PART-OF hierarchy corresponds to increasing the resolution 


325 


of the view of the object represented by the schemas. 

The sensor views are descriptions of the stable or frequently occurring 
appearances of the schema object in imagery. This description is intended to be 
used for image appearance prediction, evidence accrual for instance recognition, 
3D shape inference, and location inference. The reason for storing or runtime 
generation of explicit (parametrized) image views is that the perceptual evidence 
matches to these descriptions, not to the three dimensional ones. 

The distinctive image appearance slot holds descriptions of perceptual 
structures that are likely to occur bottom-up in the PSDB. They provide coarse 
triggers for instantiating the schema object hypothesis without prediction. 

The perceptual structure is the dynamically created PSDB query history 
generated by the schema instantiation as it attempts to fill in evidence matching 
the various schema slots and relations. The instantiator can re-use successful 
branches of perceptual structures to improve its recognition speed as it continues 
to view other instances of the same generic schema type. 

Components are pointers to other schema that represent sub-parts of the 
schema object. They are finer resolution description of the schema, one level 
down on the PART-OF hierarchy. The MUST-HAVE components are assumed 
to be parts the represented object must have to exist, although the schema may 
be instantiated without observing them all. Occasionally occurring components, 
such as center-lines on roads, can be stored in the NlAY-HAVE slot. Spatial rela- 
tionships between components as they make up the schema object are listed at 
this level also. Relationships can also be stored on a view dependent basis. 
These relationships access the sensor-view dependent data in that slot. PART- 
OF's point upward one level on the PART-OF hierarchy, indicating that this 
schema is a component of another schema. 

Classification points upward and downward one level on the IS-A hierarchy. 
There may be more than one such pointer, which Is to say that the IS-A hierar- 
chy may be partially ordered. 

Contextual relationships indicate spatial/temporai consonance or disconso- 
nance between groups of schema types, omitting those which are already indi- 
cated in the PART-OF and IS-A hierarchies. Schema that .ALWAYS or never- 
occur with the given one can be used strongly for belief or dis-beiief in the 
schema instance and as focus of attention mechanisms within the instantiation 
process. SOMETIMES occurs with relationships that are used to store the 
spatial-temporal aspects of schemas relative appearance in the viewed environ- 
ment. 


CONFUSED- WITH and SIMILAR-TO relationships indicate schema that 
may be mistaken for the given one, but for different reasons. One schema may 
ae confused with another oecause they share common evidence pieces, cut for 
which there are sufficient descriptors to disambiguate. Two schema are similar if 
there is sufficient ambiguity in their appearances, and therefore the available per- 
ceptual evidence, that they may be indistinguishable without contextual reason- 
ing. For example, tall grass may be confused with wheat from coarse shape me 
texture evidence, but can often be disambiguated by color descriptors or finer 
resolution examination of structure (because of wheat berries, for example). How- 
ever. roads are similar to runways because they cannot necessarily oe dis- 
tinguished by their Intrinsic appearance, no matter now detailed or accurate the 


descriptors and evidence. Contextual reasoning, e.g M the presence of aircraft oa 
the runway, global curvature of the road, etc. is required. 

Locational information points at the various viewframes the schema appears 
in and inferred 3D relationships with other world objects. 

Recognition strategies are prioritization cues for the schema instantiation 
processes that suggest inference chains likely to pay off to match this schema 
instance against sensor evidence. 

The recognition strategies slot in the schema data structure prioritizes infer- 
ence approaches relevant to this schema. These approaches include search for 
components, search for part of schema instance, search on weaker classification, 
relations with other schema instances, and PSDB matching. 

Search for COMPONENTS and search for PART-OF are both inferences 
along the PART-OF hierarchy in different directions. The instantiator searches 
the relevant slot to see if there are components to search for or another object of 
which this schema is a component. If the COMPONENT or PART-OF schemas 
exist, they can be accessed to continue the inference. Otherwise, each causes an 
instantiation of the missing schema to be generated as a prediction. Instantiation 
control can be transferred at this point to the COMPONENT or PART-OF 
schema. The schema inference process maintains its thread of reasoning relevant 
to the schema in the schema instantiation structure slot. 


8. LONG TERM TERRAIN DATABASE 


The long term terrain database is part of LTM. It stores the data neces- 
sary for a mobile robot to perform vision-based navigation and guidance, predict 
visral events, such as landmarks and horizon lines, and to update and refine 
maps. 


The long term terrain database contains a priori map data including 
government terrain grids, elevation data, and schemas representing instances of 
stable visual events recorded wnile traversing paths in the environment. The use 
of a priori map and grid data :o predict percepts and to heip guide image seg- 
mentation is shown in Section 5. The following presents a summary of a struc- 
ture for spatial representation and inference that enables a robot to navigate and 
guide itself through the environment. 

We first define the notion of a geographic ''place" in terms of data about 
visible landmarks. A place, as a point on the surface of the ground, is defined by 
the landmarks and spatial relationships between landmarks that can be observed 
:Yom a fixed location. More genera fiy, a place can be defined as a region n ipace. 
in which a fixed set of landmarks can be observed from anywhere in the region, 
and relationships between them do not change in some appropriate qualitative 
sense. Data about places is stored in structures called viewframes, boundaries 
and orientation regions. 

Viewframes provide a definition of place in terms of relative angles and 
angular error between landmarxs. and very coarse estimates of the absolute range 
of the landmarks from our point of observation. Viewframes allow the system to 
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localize its position in space relative to observable local landmark coordinate sys- 
tems. In performing a viewframe localization, observed or inferred data about 
the approximate range to landmarks can be used. Errors in ranging and relative 
angular separation between landmarks are smoothly accounted for. A priori map 
data can also be incorporated. A viewframe is pictured in Figure 20. 

A viewframe encodes the observable landmark information in a stationary 
panorama. That is, we assume that the sensor platform is stationary long 
enough for the sensor tr pan up to 360 degrees, to tilt up to 90 degrees (or to use 
an omni-directional sensor Cao et.al. - 86 ), to recognize landmarks in its field of 
view, or to buffer imagery and recognize landmarks while in motion. 

A sensor-centered spherical coordinate system is established. It fixes an 
orientation in azimuth and elevation, and takes the direction opposite the current 
heading as the zero degree axis. Then two landmarks in front of the vehicle, 
relative to the heading, will have an azimuth separation of less than 180 degrees. 
If we assume that no two distinguished landmark points have the same elevation 
coordinates (i.e.. no two distinguished points appear one directly above the other) 
then a well-ordering of the landmarks in the azimuth direction can be generated. 
We can speak of the landmarks as being “ordered from left to right”. The rela- 
tive solid angle between two distinguished landmark points is now well defined. 

Under the above assumptions, the system can pan from left to right, recog- 
nizing landmarks. L t , and storing the solid angles between landmarks in order, 
denoting the angle between the i-th and j-th landmarks by Ang t; . The basic 
viewframe data are these two ordered lists, ( L V L <>,...) and (Ang l2 ,Ang 2 3 ,...). The 
relative angular displacement between any two landmarks can be computed from 
this basic list. In Levitt et.al. - 87- we show how to use this data to essentially 
parametrize all possible triangulations of our location relative to a set of simul- 
taneously visible landmarks. This localizes the robot’s position in space relative 
to a local landmark coordinate system. 

View-frames contain two basic dimensions of data: the relative angles 
between landmarks, and the estimated range (intervals) to the landmarks. If we 
drop the range information, we are left with purely topological data. That is, it 
is impossible, using only the relative angles between landmarks, and no range, 
map or other metric data, to determine the relative angles between triples of 
landmarks, or to construct parametric representations of our location with 
respect to the landmarks. Nonetheless, there is topological localization informa- 
tion present in the ordinal sequence of landmarks: there is a sense in which we 
can compute differences between geographic regions, and observe which region we 
are in. 

The basic concept is to note that if we draw a lice between two (poL.t) 
landmarks, and project that line onto the (possibly not fiat) surface of the 
g-ound, then this line iivides the earth into two distinct regions. If we ran 
observe the landmarks, we can observe which side of this line we are on. The 
■‘virtual boundary" created by associating two observable landmarks together 
thus divides space over the region in which both landmarks are visible. We call 
these landmark-pair-boundaries (LPB's), and denote the LPB constructed from 
the landmarks L x and L 2 by LPB(I V L 2 1. 

Roughly speaking, if we observe that landmark L x is on our left hand, and 
landmark L 2 is on our right, and the angle from L x to L 2 (left to right) is less 
than 180 degrees, then we denote this side of, or equivalently, this orientation of. 
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the LPB by L t L 2 ]. If we stand oq the other side of the boundary, LPB(L V L 2 ), 
“facing” the boundary, then L 2 will be on our left hand and L j on our right and 
the angle between them less than 180 degrees, and we can denote this orientation 
or side as L jj (left to right). 

More rigorously, define: 

orientation-of-LPB(£ lt L 2 ) 

~ l if ®12 < * 

= sign(x-0 I2 ) = 0 if e, 2 = t 

-i if e l2 > x 

where 0 l2 is the relative azimuth angle between L x and L 2 measured in an arbi- 
trary sensor-centered coordinate system. Here, an orientation of -1 corresponds 
to the L i L zl side of LPB (L P L 2 ), -1 corresponds to the [L Z L X j side of 
LPB(L l ,i ; ) and 0 corresponds to being on LPB(L X ,L z ). It is a straightforward 
to show that this definition of LPB orientation does not depend on the choice of 
sensor-centered coordinate system. 

LPB's give rise to a topological division of the ground surface into observ- 
able regions of localization, called orientation regions. Crossing boundaries 
between orientation regions leads to a qualitative sense of path planning based on 
perceptual information. The three levels of spatial representation given by map 
or metric data, viewframes and orientation regions are pictured in Figure 9. A 




Figure 9: Muitipie-Leve’s-of-Spatial Representation 


natural environmental representation based on viewframes recorded while follow- 
ing a path is given by two lists, one list of the ordered sequence of viewframes 
collected on the path, and another of the set of landmarks observed on the path. 
We call the viewframe list a viewpath. The landmark list acts as an index into 
the viewpath, each landmark pointing at the observations of itself in the 
viewframes. For efficiency, the landmark list can be formed as a database that 
can be accessed based on spatial and/or visual proximity. Visual proximity can 
be observed, or computed from an underlying elevation grid and a model of sen- 
sor and vision system resolution. 

The first occurrence of a landmark points at the instantiated schema or per- 
ceptual structure in the vision system database that was used to gather evidence 
in the landmark recognition process. After that, ail recognized re-occurrences of 
this landmark point back at this initial instance. The same is true for the first 
occurrences and successful re-recognition of LPB’s and viewframes. This mechan- 
ism allows multiple visual path representations, built at different times, to be 
incrementally integrated together as they are acquired by using a common land- 
mark indexing pointer list. 

We use an environmental representation for orientation-region reasoning 
that is a list of oriented LPB’s encountered and crossed in the course of following 
a path. We call such a list an orientation-path. .As with viewpaths, there is an 
associated landmark list that indexes into the orientation-path. 

A dynamically acquirable environmental representation that merges the 
representations for viewpaths and orientation-paths consists of an ordered list 
interspersing viewframes. LPB crossings, and appearance and occlusion (or loss of 
resolution) of landmarks, as well as recording the headings taken in the course of 
following the path over which the environmental map is being built. Thus, we 
can integrate the representations required for viewframe and orientation region 
based reasoning with heading and landmark information to formulate an environ- 
mental representation that supports hybrid strategies for navigation and gui- 
dance. The representation is formed at runtime and consists of multiple inter- 
locking lists of sequential, time ordered, lists of visual events that include those 
necessarv for the navigation 3nd guidance algorithms presented in Levitt et.al. - 
87 . 


7. PROCESSING EXAMPLE 


The following processing example demonstrates the behavior of some imple- 
mented system components. These include the format of predictions from the 
long term terrain model, the extraction of perceptually significant groupings from 
“re PrfDB. how* an *nstant : atec schema uses grouping t> recesses and queries over 
the PSDB. ana extracting relevant cues for maxing viewframe localizations :n tne 
long term terrain representation. 

Figure 10 shows the elevation contours and road .twork in 'he a prior: ter- 
rain data from the Martin Marietta ALV 'est site in Denver which was supplied 
by the LIS. Army Engineer Topographic Laboratories (ETLL The vehicle posi- 
tion on the road is indicated by the arrow in the figure. From this, we are able 
to roughly determine the correspondence between an image taken from the road 
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figure 12: Predicted Segmentation From Grid Data 


and the terrain data, (the relevant sensor parameters were not available). Figure 
11 shows the terrain and feature classification supplied with the a priori data. 
These correspond to sets of image overlays in register with the elevation data. 
The road network is stored as a set of curve objects that is decomposed into 
linear segments with supplied attributes, such as road material and width. Ter- 
rain Batches are extracted as regions from terrain type information and 
parametric surface fits to the a priori elevation data. 

Figure 12 shows how the grid registered terrain data is instantiated into 
STM f o 'orm a predicted segmentation. The grid data regions from connected 
analysis correspond to schema instances in the Long Term terrain memory. Esta- 
blished surface display techniques are used to project the elevation with the asso- 
ciated schema instances to form a predicted view. Image positions are then 
labeled with their associated schema instances. Additionally, there are many 
schema instances, ordered by depth, at the corresponding image locations. The 
resulting predicted segmentation is processed as an abstract image where critical 
perceptual events are determined by size, adjacencies across occlusion boundaries, 
or types cf terrain with high semantic contrast, such as water, fields, or man- 
made structures. The perceptual structures are merged together based upon 
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distances and semantic type to yield predictions at different resolutions. 

Figure 13 shows the predicted terrain patches for the vehicle positioned 
with respect to the terrain in Figure 10. Figure 14 shows the predicted segmenta- 
tion after filtering to pull out the horizon line and road terrain discontinuities for 
roads near the vehicle. This data is quite coarse (30m sampling), and image areas 
in the foreground are highly composite containing instances of road and the adja- 
cent grassy fields. Nonetheless, the predicted segmentation yields a qualitative 
description of predicted image features that is sufficient to initialize and direct 
grouping processes to find corresponding image features and relationships. The 
key characteristics of the predicted segmentation are that the vehicle is on a flat 
plane, and that its field of view consists of road and grassy field terrain patches 
with some mountains in the distance. Predictions cf the dirt road off to the right 
and the intersection are made from the road-network and the elevation informa- 
tion stored along with it. The predictions are in terms of constraints on region 
adjacencies across boundaries, and the shape and attributes, such as color con- 
trasts, of the boundaries themselves. The horizon line constraints are that it will 
tend to have smoothly changing orientation and be adjacent to a large homogene- 
ous region (the sky). In general, the predicted features are described with con- 
strained attributes determined from the visibility components of schemas. 

Figures 15 and 16 show some of the contour related structures in the initial- 
ized PSDB. Figure 15 shows the edges extracted at one spatial resolution using 
the Canny edge operator Canny - S3 . We have found it useful not to apply 
noise suppression to extracted segments in order to base filtering on structural 
properties of the contours, including linear deviation and relationships to other 
image structures. Different linear segment fits for this extracted edge images are 
shown in Figure 16. 

Figure 17 shows the results of grouping processes applied to a set cf 
selected curves in Figure 12 with multiple associated attributes for orientation 
and color contrasts. The grouping processes were constrained by the predicted 
segmentation in Figure 14 using constraints on allowable color contrasts, changes 
in linear segment orientation, and rough image position and extent. Multiple 
groups are obtained for each predicted image event. Selection of one, or main- 
taining multiple alternative groups, is explicitly represented in the schema instan- 
tiation structure. Here, groups were selected based upon length and uniformity 
of composite attributes. 

F’gure 18 shows the results of a road schema instantiation based upon 
matches to extracted road boundar es in accounting for road surface properties 
through PART-OF relations. Texture elements adjacent to the road boundary 
which are consistent with a road surface, such as low contrast, parallel edges 
corresponding to ‘read marks, are used to direct queries to instantiate potential 
road area. Queries are also used to determine the presence of anomalous struc- 
tures : n the road such as anything which is high contrast or oriented perpendicu- 
lar to the road uirection. Sued structures require aisamoiguation through instan- 
tiation of another schema lit could be a road marking) cued by the anomaly or 
elevation estimates derived from motion displacements or range sensing. 

Significant image structures near the horizon line are particularly important 
for landmark extraction. Figure 19 shows extracted interesting perceptual groups 
near and above the horizon line. Figure 20 shows an extracted view-frame 
representing the relative visuai spatial relationships between some of the objects 
extracted from this field of view. 









Figure 18: Road Schema Instantiation Figure 19: Significant Perceptual Groups 




bOaL <OqMiUH4TTTlk KUL ‘'•OTC*. QfnHKTWl P.UWIQ 

fDCgff ^uo c AOUTWMiaBii (C„ c„ c„ c 9 cv r. c .} 


Figure 20: VIewframe Instance 
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8. SUMMARY 


The architecture we have developed, using terrain and road schemas with 
implemented system components for . perceptual processing and manipulating long 
term terrain data, has been successfully used in tasks for ALV navigation and 
scene interpretation. 
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